NLTK is the Natural Language Toolkit, a fairly large Python library for doing many sorts of linguistic analysis of text. NLTK comes with a selection of sample texts that we'll use to day, to get yourself familiar with what sorts of analysis you can do.
To run this notebook you will need the nltk
, matplotlib
, and tkinter
modules. If you are new to Python and programming, the best way to have these is to make sure you are using the Anaconda Python distribution, which includes all of these and a whole host of other useful libraries. You can check whether you have the libraries by running the following commands in a Terminal or Powershell window:
python -c 'import nltk'
python -c 'import matplotlib'
python -c 'import tkinter'
If you don't have NLTK, you can install it using the pip
command (or possibly pip3
if you're on a Mac) as usual.
pip install nltk
If you don't have Matplotlib or TkInter, and don't want to download Anaconda, you will be able to follow along with most but not all of this notebook.
Once all this package installation work is done, you can run
python -c 'import nltk; nltk.download()'
or, if you are on a Mac with Python 3.4 installed via the standard Python installer:
python3 -c 'import nltk; nltk.download()'
and use the dialog that appears to download the 'book' package.
We will start by loading the example texts in the 'book' package that we just downloaded.
In [1]:
from nltk.book import *
This import
statement reads the book samples, which include nine sentences and nine book-length texts. It has also helpfully put each of these texts into a variable for us, from sent1
to sent9
and text1
to text9
.
In [2]:
print(sent1)
print(sent3)
print(sent5)
Let's look at the texts now.
In [3]:
print(text6)
print(text6.name)
print("This text has %d words" % len(text6.tokens))
print("The first hundred words are:", " ".join( text6.tokens[:100] ))
Each of these texts is an nltk.text.Text object, and has methods to let you see what the text contains. But you can also treat it as a plain old list!
In [4]:
print(text5[0])
print(text3[0:11])
print(text4[0:51])
We can do simple concordancing, printing the context for each use of a word throughout the text:
In [5]:
text6.concordance( "swallow" )
The default is to show no more than 25 results for any given word, but we can change that.
In [6]:
text6.concordance('Arthur', lines=37)
We can adjust the amount of context we show in our concordance:
In [7]:
text6.concordance('Arthur', width=100)
...or get the number of times any individual word appears in the text.
In [8]:
word_to_count = "KNIGHT"
print("The word %s appears %d times." % ( word_to_count, text6.count( word_to_count ) ))
We can generate a vocabulary for the text, and use the vocabulary to find the most frequent words as well as the ones that appear only once (a.k.a. the hapaxes.)
In [9]:
t6_vocab = text6.vocab()
t6_words = list(t6_vocab.keys())
print("The text has %d different words" % ( len( t6_words ) ))
print("Some arbitrary 50 of these are:", t6_words[:50])
print("The most frequent 50 words are:", t6_vocab.most_common(50))
print("The word swallow appears %d times" % ( t6_vocab['swallow'] ))
print("The text has %d words that appear only once" % ( len( t6_vocab.hapaxes() ) ))
print("Some arbitrary 100 of these are:", t6_vocab.hapaxes()[:100])
You've now seen two methods for getting the number of times a word appears in a text: t6.count(word)
and t6_vocab[word]
. These are in fact identical, and the following bit of code is just to prove that. An assert
statement is used to test whether something is true - if it ever isn't true, the code will throw up an error! This is a basic building block for writing tests for your code.
In [10]:
print("Here we assert something that is true.")
for w in t6_words:
assert text6.count( w ) == t6_vocab[w]
print("See, that worked! Now we will assert something that is false, and we will get an error.")
for w in t6_words:
assert w.lower() == w
We can try and find interesting words in the text, such as words of a minimum length (the longer a word, the less common it probably is) that occur more than once or twice...
In [11]:
# With a list comprehension
long_words = [ w for w in t6_words if len( w ) > 5 and t6_vocab[w] > 3 ]
# The long way, with a for loop. This is identical to the above.
long_words = []
for w in t6_words:
if( len ( w ) > 5 and t6_vocab[w] > 3 ):
long_words.append( w )
print("The reasonably frequent long words in the text are:", long_words)
And we can look for pairs of words that go together more often than chance would suggest.
In [12]:
print("\nUp to twenty collocations")
text6.collocations()
print("\nUp to fifty collocations")
text6.collocations(num=50)
print("\nCollocations that might have one word in between")
text6.collocations(window_size=3)
NLTK can also provide us with a few simple graph visualizations, when we have matplotlib installed. To make this work in iPython, we need the following magic line. If you are running in PyCharm, then you do not need this line - it will throw an error if you try to use it!
In [13]:
%pylab --no-import-all inline
The vocabulary we get from the .vocab()
method is something called a "frequency distribution", which means it's a giant tally of each unique word and the number of times that word appears in the text. We can also make a frequency distribution of other features, such as "each possible word length and the number of times a word that length is used". Let's do that and plot it.
In [14]:
word_length_dist = FreqDist( [ len(w) for w in t6_vocab.keys() ] )
word_length_dist.plot()
We can plot where in the text a word occurs, and compare it to other words, with a dispersion plot. For example, the following dispersion plots show respectively (among other things) that the words 'coconut' and 'swallow' almost always appear in the same part of the Holy Grail text, and that Willoughby and Lucy do not appear in Sense and Sensibility until some time after the beginning of the book.
In [15]:
text6.dispersion_plot(["coconut", "swallow", "KNIGHT", "witch", "ARTHUR"])
text2.dispersion_plot(["Elinor", "Marianne", "Edward", "Willoughby", "Lucy"])
We can go a little crazy with text statistics. This block of code computes the average word length for each text, as well as a measure known as the "lexical diversity" that measures how much word re-use there is in a text.
In [16]:
def print_text_stats( thetext ):
# Average word length
awl = sum([len(w) for w in thetext]) / len( thetext )
ld = len( thetext ) / len( thetext.vocab() )
print("%.2f\t%.2f\t%s" % ( awl, ld, thetext.name ))
all_texts = [ text1, text2, text3, text4, text5, text6, text7, text8, text9 ]
print("Wlen\tLdiv\tTitle")
for t in all_texts:
print_text_stats( t )
So far we have been using the sample texts, but we can also use any text that we have lying around on our computer. The easiest sort of text to read in is plaintext, not PDF or HTML or anything else. Once we have made the text into an NLTK text with the Text()
function, we can use all the same methods on it as we did for the sample texts above.
In [17]:
from nltk import word_tokenize
# You can read the file this way:
f = open('alice.txt', encoding='utf-8')
raw = f.read()
f.close()
# or you can read it this way.
with open('alice.txt', encoding='utf-8') as f:
raw = f.read()
# Use NLTK to break the text up into words, and put the result into a
# Text object.
alice = Text( word_tokenize( raw ) )
alice.name = "Alice's Adventures in Wonderland"
print(alice.name)
alice.concordance( "cat" )
print_text_stats( alice )
In [18]:
from nltk.corpus import gutenberg
print(gutenberg.fileids())
paradise_lost = Text( gutenberg.words( "milton-paradise.txt" ) )
paradise_lost
Out[18]:
Paradise Lost is now a Text object, just like the ones we have worked on before. But we accessed it through the NLTK corpus reader, which means that we get some extra bits of functionality:
In [19]:
print("Length of text is:", len( gutenberg.raw( "milton-paradise.txt" )))
print("Number of words is:", len( gutenberg.words( "milton-paradise.txt" )))
assert( len( gutenberg.words( "milton-paradise.txt" )) == len( paradise_lost ))
print("Number of sentences is:", len( gutenberg.sents( "milton-paradise.txt" )))
print("Number of paragraphs is:", len( gutenberg.paras( "milton-paradise.txt" )))
We can also make our own corpus if we have our own collection of files, e.g. the Federalist Papers from last week. But we have to pay attention to how those files are arranged! In this case, if you look in the text file, the paragraphs are set apart with 'hanging indentation' - all the lines
In [20]:
from nltk.corpus import PlaintextCorpusReader
from nltk.corpus.reader.util import read_regexp_block
# Define how paragraphs look in our text files.
def read_hanging_block( stream ):
return read_regexp_block( stream, "^[A-Za-z]" )
corpus_root = 'federalist'
file_pattern = 'federalist_.*\.txt'
federalist = PlaintextCorpusReader( corpus_root, file_pattern, para_block_reader=read_hanging_block )
print("List of texts in corpus:", federalist.fileids())
print("\nHere is the fourth paragraph of the first text:")
print(federalist.paras("federalist_1.txt")[3])
And just like before, from this corpus we can make individual Text objects, on which we can use the methods we have seen above.
In [21]:
fed1 = Text( federalist.words( "federalist_1.txt" ))
print("The first Federalist Paper has the following word collocations:")
fed1.collocations()
print("\n...and the following most frequent words.")
fed1.vocab().most_common(50)
Out[21]:
In linguistics, stopwords or function words are words that are so frequent in a particular language that they say little to nothing about the meaning of a text. You can make your own list of stopwords, but NLTK also provides a list for each of several common languages. These sets of stopwords are provided as another corpus.
In [22]:
from nltk.corpus import stopwords
print("We have stopword lists for the following languages:")
print(stopwords.fileids())
print("\nThese are the NLTK-provided stopwords for the German language:")
print(", ".join( stopwords.words('german') ))
So reading in the stopword list, we can use it to filter out vocabulary we don't want to see. Let's look at our 50 most frequent words in Holy Grail again.
In [23]:
print("The most frequent words are: ")
print([word[0] for word in t6_vocab.most_common(50)])
f1_most_frequent = [ w[0] for w in t6_vocab.most_common() if w[0].lower() not in stopwords.words('english') ]
print("\nThe most frequent interesting words are: ", " ".join( f1_most_frequent[:50] ))
Maybe we should get rid of punctuation and all-caps words too...
In [24]:
import re
def is_interesting( w ):
if( w.lower() in stopwords.words('english') ):
return False
if( w.isupper() ):
return False
return w.isalpha()
f1_most_frequent = [ w[0] for w in t6_vocab.most_common() if is_interesting( w[0] ) ]
print("The most frequent interesting words are: ", " ".join( f1_most_frequent[:50] ))
Quite frequently we might want to treat different forms of a word - e.g. 'make / makes / made / making' - as the same word. A common way to do this is to find the stem of the word and use that in your analysis, in place of the word itself. There are several different approaches that can be takenNone of them are perfect, and quite frequently linguists will write their own stemmers.
Let's chop out a paragraph of Alice in Wonderland to play with.
In [25]:
my_text = alice[305:549]
print(" ". join( my_text ))
print(len( set( my_text )), "words")
NLTK comes with a few different stemming algorithms; we can also use WordNet (a system for analyzing semantic relationships between words) to look for the lemma form of each word and "stem" it that way. Here are some results.
In [26]:
from nltk import PorterStemmer, LancasterStemmer, WordNetLemmatizer
porter = PorterStemmer()
lanc = LancasterStemmer()
wnl = WordNetLemmatizer()
porterlist = [porter.stem(w) for w in my_text]
print(" ".join( porterlist ))
print(len( set( porterlist )), "Porter stems")
lanclist = [lanc.stem(w) for w in my_text]
print(" ".join( lanclist ))
print(len( set( lanclist )), "Lancaster stems")
wnllist = [ wnl.lemmatize(w) for w in my_text ]
print(" ".join( wnllist ))
print(len( set( wnllist )), "Wordnet lemmata")
In [27]:
from nltk import pos_tag
print(pos_tag(my_text))
Tag | Meaning | Examples |
---|---|---|
JJ | adjective | new, good, high, special, big, local |
RB | adverb | really, already, still, early, now |
CC | conjunction | and, or, but, if, while, although |
DT | determiner | the, a, some, most, every, no |
EX | existential | there, there's |
FW | foreign word | dolce, ersatz, esprit, quo, maitre |
MD | modal verb | will, can, would, may, must, should |
NN | noun | year, home, costs, time, education |
NNP | proper noun | Alison, Africa, April, Washington |
NUM | number | twenty-four, fourth, 1991, 14:24 |
PRO | pronoun | he, their, her, its, my, I, us |
IN | preposition | on, of, at, with, by, into, under |
TO | the word to | to |
UH | interjection | ah, bang, ha, whee, hmpf, oops |
VB | verb | is, has, get, do, make, see, run |
VBD | past tense | said, took, told, made, asked |
VBG | present participle | making, going, playing, working |
VN | past participle | given, taken, begun, sung |
WRB | wh determiner | who, which, when, what, where, how |
Automated tagging is pretty good, but not perfect. There are other taggers out there, such as the Brill tagger and the TreeTagger, but these aren't set up to run 'out of the box' and, with TreeTagger in particular, you will have to download extra software.
Some of the bigger corpora in NLTK come pre-tagged; this is a useful way to train a tagger that uses machine-learning methods (such as Brill), and a good way to test any new tagging method that is developed. This is also the data from which our knowledge of how language is used comes from. (At least, English and some other major Western languages.)
In [28]:
from nltk.corpus import brown
print(brown.tagged_words()[:25])
print(brown.tagged_words(tagset='universal')[:25])
We can even do a frequency plot of the different parts of speech in the corpus (if we have matplotlib
installed!)
In [29]:
tagged_word_fd = FreqDist([ w[1] for w in brown.tagged_words(tagset='universal') ])
tagged_word_fd.plot()
As well as the parts of speech of individual words, it is useful to be able to analyze the structure of an entire sentence. This generally involves breaking the sentence up into its component phrases, otherwise known as chunking.
Not going to cover chunking here as there is no out-of-the-box chunker for NLTK! You are expected to define the grammar (or at least some approximation of the grammar), and once you have done that then it becomes possible.
But one application of chunking is named-entity recognition - parsing a sentence to identify the named people, places, and organizations therein. This is more difficult than it looks, e.g. "Yankee", "May", "North".
Here's how to do it. We will use the example sentences that were loaded in sent1
through sent9
to try it out. Notice the difference (in iPython only!) between printing the result and just looking at the result - if you try to show the graph for more than one sentence at a time then you'll be waiting a long time. So don't try it.
In [30]:
from nltk import ne_chunk
tagged_text = pos_tag(sent2)
ner_text = ne_chunk( tagged_text )
print(ner_text)
ner_text
Out[30]:
Here is a function that takes the result of ne_chunk
(the plain-text form, not the graph form!) and spits out only the named entities that were found.
In [31]:
def list_named_entities( tree ):
try:
tree.label()
except AttributeError:
return
if( tree.label() != "S" ):
print(tree)
else:
for child in tree:
list_named_entities( child )
list_named_entities( ner_text )
And there you have it - an introductory tour of what is probably the best-available code toolkit for natural language processing. If this sort of thing interests you, then there is an entire book-length tutorial about it:
Have fun!